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Abstract — In this paper, both non-mixing and mixing local 
minima of the entropy are analyzed from the viewpoint of 
blind source separation (BSS); they correspond respectively to 
acceptable and spurious solutions of the BSS problem. The 
contribution of this work is twofold. First, a Taylor development 
is used to show that the exact output entropy cost function has 
a non-mixing minimum when this output is proportional to any 
of the non-Gaussian sources, and not only when the output is 
proportional to the lowest entropic source. Second, in order to 
prove that mixing entropy minima exist when the source densities 
are strongly multimodal, an entropy approximator is proposed. 
The latter has the major advantage that an error bound can be 
provided. Even if this approximator (and the associated bound) 
is used here in the BSS context, it can be applied for estimating 
the entropy of any random variable with multimodal density. 

Index Terms — Blind source separation. Independent compo- 
nent analysis. Entropy estimation. Multimodal densities. Mixture 
distribution. 

EDICS Category: 



I. Introduction 

Blind source separation (BSS) aims at recovering a vector 
of independent sources S = [Si,-- - ,Sk] t from observed 
mixtures X = [X\, • • • , Xm] T - In this paper, we assume that 
K = M and X = AS, where A is the K-hy-K mixing 
matrix. The sources can be recovered by finding an unmixing 
matrix B such that W = BA is non-mixing (i.e. with one 
non-zero entry per row and per column). Such matrices B can 
be found by minimizing an ad-hoc cost function (see [1], the 
books [2], [3], [4] and references therein). 

In practice, the minimum of these criteria is reached by 
adaptive methods such as gradient descents. Therefore, one 
has to pay attention to the solutions corresponding to these 
minima. In most of cases, the global minimum is a solution of 
the BSS problem. By contrast, the possible local minima can 
either correspond to a desired solution (referred as non-mixing 
minima) or spurious solution (referred as mixing minima) of 
the problem. For example, the optimization algorithm could 
be trapped in minima that do not correspond to an acceptable 
solution of the BSS problem. Therefore, it is of interest to 
study the possible existence of both non-mixing and mixing 
local minima. 

The paper deals with this issue by extending existing results 
of related work. The introduction first presents the two main 
approaches for source separation and details the state-of-the- 
art related to the local minima of BSS criteria. Then, the 
objectives and the organization of the paper is presented. 



A. Symmetric and deflation approaches 

To determine matrix B, two approaches can be investigated. 
The first one (called symmetric) aims at extracting all sources 
simultaneously. The second approach (called deflation) ex- 
tracts the sources one by one. 

• The common symmetric approach consists in minimizing 
the Kullback-Leibler divergence between the joint density 
and the product of the marginal densities of the recovered 
sources (i.e. their mutual information), which are the 
components Y% , . . . , Yk of Y = BX. This leads to the 
minimization of (see [5], [6], [7]) 



C(B) =^tf(y fc )-log|detB| 



(1) 



where H(Y) denotes Shannon's differential entropy 
Y [5], [6]: 



H(Y) 



Vv{y) log(p Y (y))dy 



(2) 



In eq. (O, py denotes the probability density function 
(pdf) of Y. A variant of this approach applies the unmix- 
ing matrix B to a whitened version of the observations. 
In this case, since the sources are uncorrelated and can be 
assumed to have the same variance, one can constrain B 
to be orthogonal [2] . The term log det B in criterion (Q]l 
disappears and C(B) is to be minimized over the group 
of orthogonal matrices. 
• The deflation approach [8] extracts the fc-th source by 
computing the fc-th row bfc of B by minimizing a non 
Gaussianity index of bfcX subject to the constraint that 
bfcX is uncorrelated to b^X for i < k. By taking this 
index to be the negentropy [9] and assuming (without loss 
of generality) that the sources have the same variance, 
the cost function can be written as H(wkS) — log ||wfc|| 
plus a constant, where Wfc = bfc A and ||wfc|| denotes the 

Euclidean norm J WfcW^ [10], [11]. Since this function 
is unchanged when Wfc is multiplied by a scalar, this 
leads to minimizing if(wfcS) under the w^wj = 6i t k 
constraint for 1 < i, k < K, where S j & is the Kronecker 
delta [12]. 

B. Related works 

Although both symmetric and deflation procedures could be 
analyzed in this contribution with the same tools, we focus on 
the entropy H(Yk), used in the deflation approach. 
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Several results exist regarding the entropy minima of Y = 
wS (the subscript "fc" has been omitted in the following, since 
one signal is extracted at a time in the deflation approach). The 
first kind of results discusses the existence of non-mixing local 
minima of H{Y) that correspond to the extraction of a single 
source. The second kind of results discusses the existence of 
mixing minima that correspond to spurious solutions of the 
BSS problem: Y is still a mixture of sources despite the fact 
that H(Y) is a local minimum. These results are summarized 
below. 

• Non-mixing entropy local minima 

It has been shown that the global minimum of H(Y) with 
Y = wS is reached when the output Y is proportional to 
the source with the lowest entropy [10]. It is proven in [9] 
that when a fixed-variance output is proportional to one 
of the sources, then, under some technical conditions, the 
cumulant-based approximation of entropy Hj(Y) used 
in FastICA [9] reaches a non-mixing local minimum. 
Finally, based on the entropy power inequality [13], it 
is also proven in [14] that, in the two-dimensional case, 
Shannon's entropy has a local minimum when the output 
is proportional to a non-Gaussian source. 

• Mixing entropy local minima 

As for the mutual information, simulations results in [15] 
suggest that mixing local entropy minima exist in specific 
cases (i.e. when the source pdfs are strongly multimodal, 
which sometimes occur in practice, for sinusoid wave- 
forms among other). These results, based on density esti- 
mation using the Parzen kernel method, are confirmed by 
other simulations using directly entropy estimation, such 
as Vasicek's one in [16] or based on the approximator 
analyzed in this paper in [17]. Rigorously speaking, the 
above results do not constitute an absolute proof since 
error bounds are not available for the approximation 
procedure. By contrast, a theoretical proof is given in 
[18], but for a specific example only (two bimodal 
sources sharing the same symmetric pdf). The existence 
of mixing local entropy minima has also been shown 
in [19] (without detailed proof) in the case of two non 
symmetric sources with strongly multimodal pdfs. 

C. Objectives and organization of the paper 

In this paper, additional results regarding mixing and non- 
mixing entropy minima are presented. Two main results will 
be proven. 

Firstly, it will be shown in the next section that the exact 
entropy of an output H(Y) with a fixed variance has local non- 
mixing minima: the entropy H{Y) has a local minimum when 
Y is proportional to one of the non-Gaussian sources. This is 
an extension of the results presented in [18] to the case of K > 
2 sources. If the output is proportional to the Gaussian source 
(if it exists), the entropy has a global maximum. Numerical 
simulations illustrate these results in the K = 2 case, for the 
ease of illustration. 

Secondly, in Section III, an entropy approximator is pre- 
sented, for which an error bound can be derived. It is suitable 
for variables having multimodal densities with modes having a 



low overlap, in the sense that its error bound converges to zero 
when the mode overlap becomes negligible. This approximator 
was mentioned in [17] and error bounds have been provided 
in [19] without proof. In the BSS context, when the sources 
have such densities, the use of this approximator makes it 
possible to show that the marginal entropy has local mixing 
minima. This approach can be applied to a wider class of 
source densities than the score function-based method derived 
in [18]. The results presented in this paper further extend those 
in [19] as they are not restricted to the case of K = 2 sources. 
Finally, we provide a detailed proof of the bound formula for 
the entropy approximator. 

It must be stressed that the aforementioned entropy approx- 
imator can be used for other applications that require entropy 
estimation of multimodal densities. 



II. Local non-mixing minima of output entropy 

In this section, we shall prove that H(wS), under the 
| w | = 1 constraint, reaches a local minimum at w = lj, 
the j-th row of the K x K identity matrix, if Sj is non- 
Gaussian, or a global maximum otherwise. Note that, as it 
is well known, the global minimum is reached at I/, where 
k = argmmfe H(S k ). 



A. Theoretic development 

The starting point is an expansion of the entropy of a 
random variable Y slightly contaminated with another variable 
5Y up to second order in SY, which has been established 
in [20]: 

H(Y + 5Y) « H(Y) + E[tPy(Y)5Y] + 

i{E[var(5Y|y)^(y)] - [E(6Y\Y)} /2 } (3) 

In this equation, tpy is the score function of Y, defined as 
— (logpy)Q, Py is the pdf of Y, ' denotes the derivative, and 
E(-|F) and var(-|Y) denote the conditional expectation and 
conditional variance given Y, respectively. 

Assume that w is close from lj so that its z-th component 
Wi is close to for i ^ j. Under the ||w|| = 1 constraint, 
wj = Jl - J2i^ 3 w i an d since y/1 — x = 1 — + o(x), 
one can write 

W J • = 1 - 2 W i + °( W *) ■ 

Thus, wS = Sj + 5Sj with 

i^J vfrj 

'in this paper, we use the score function definition presented in [7]. 
However, several authors define this function with the opposite sign. The 
reader should have this difference in mind. 
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Therefore, applying ([3]) and dropping higher order terms, one 
gets that H(wS) equals 

HiS,) + (X>)E[^.(S^] - Kj^wfjE^s^Sj} 
+ l{v[v a v(^w?S i \s j y s .(S^ - [j2 Wi E(Si\S^]' 2 } 

Since the sources are mutually independent, any non-linear 
mapping of them is uncorrected so that E[ifjs (Sj)Si] — 
0, for i ^ j. Furthermore E(Si\Sj) = E(S-) = for 
i 7^ j, E[ipSj(Sj)Sj] = 1 (by integration by parts), and 

where <r| denotes the common variance of the sources. There- 
fore 

Jf(wS) = H(S 3 ) + K^w^Wln^iS,)] - 1} 



+0 



^3 



(4) 



Note that again by integration by parts, E[ip' s .(Sj)] can be 
rewritten as E[ipg.(Sj)], which is precisely Fisher's informa- 
tion [5]. In addition, by Schwarz's inequality [5], one has 



|E{[^-E(^)]^(S,)}| 



< 



(Sj 



with equality if and only if if>g. is a linear function. But since 
as mentioned above E[ips- (<Sj)] = and E[Sjips- (Sj)] = 
1, the left hand side of the above inequality equals 1. Thus 
cr|E[^|. (Sj)] > 1 unless ips, is linear (which means that Sj 
is Gaussian) in which case <TgE[ipg. (Sj)} = 1. One concludes 
from © that H(wS) > H(Sj) for all w sufficiently close 
to Ij if Sj is non-Gaussian. Thus H(wS) reaches local non- 
mixing minima at w = ±Ij (since H(— wS) = H(wS)), as 
long as Sj is non-Gaussian. If Sj is Gaussian then H(Sj) is 
a global maximum since Gaussian random variables have the 
highest entropy for a given variance. Equality is of no use 
in this case, since the second term in this equality vanishes. 

B. Numerical simulations 

In this subsection, three simple examples are analyzed in 
the K = 2 case. In this case, the unit-norm vector w can 
be rewritten as [sin#, cos6>] and iJ(wS) is considered as a 
function of 9. The entropy is computed through eq. (01, in 
which the pdf were estimated from a finite sample set (1000 
samples), using Parzen density estimation [21], [22] with 
Gaussian Kernels of standard deviation <jk = 0.5<tx * 
(S denotes the number of samples and ax is the empirical 
standard deviation, enforced to be equal to one here) and 
Riemannian summation instead of exact integration. 

Example 1: Assume that S\ and 52 have uniform densities. 
According to the above results, local minima exist for 9 E 
{pir/2\p £ Z}. In this example, no mixing minimum can be 
observed (Fig. Q2 a ))- 




Fig. 1. Evolution of of H(wS) vs 9: (a) Example 1: two Uniform sources 
(b) Example 2: Uniform (Si) and Gaussian (S2) sources; (c) Example 3: 
two bimodal sources. The non-mixing minima are indicated by dash-dotted 
vertical lines, the mixing ones by dotted lines. 



Example 2: Suppose now that S\ and S2 have uniform and 
Gaussian distributions respectively. Local minima are found 
for 9 £ {{2p+ l)7r/2}, p £ Z, and local maxima for 9 £ {pir} 
(Fig- 02b)). Again, no spurious minimum can be observed in 
this example. 

Example 3: Consider two source symmetric pdfs p Sl and 
p S2 that are constituted by i) two non-overlapping uniform 
modes and ii) two Gaussian modes with negligible overlap, 
respectively. One can observe that non-mixing solutions occur 
for(?G W2} (Fig.Htc)). 

In addition to an illustration of the above theoretical result, 
the last example shows the existence os spurious (mixing) 
local minima for 9 {pn/2}. However, the figure does 
not constitute a proof of the existence of local minima of 
H(wS); the minima visible on the figure could indeed be 
a consequence of the entropy estimator (more precisely, of 
the pdf estimation). In the next section, we derive an entropy 
estimator and an associated error bound. This approximator 
is efficient for estimating the entropy of variables having 
multimodal densities, in the sense that the error bound tends 
to zero when the mode overlaps decrease. Next, thanks to this 
approximator, it will be theoretically proven that mixing local 
minima exist for strongly multimodal source densities. 

III. Entropy approximator 

In this section, we introduce the entropy approximator 
first derived in [17]. The detailed proofs of the upper and 
lower bounds of the entropy based on this approximator, 
already mentioned in [19] without proof, are given. Illustrative 
examples are further provided. The entropy bounds will be 
used in the next section to prove that for a specific class of 
source distributions, the entropy function H(wS) can have 
a local minimum that does not correspond to a row of the 
identity matrix. The presented approach yields more general 
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results than those in [18], since it is no longer constrained that 
the sources share a common symmetric pdf. 

This approach relies on an entropy approximation of a 
multimodal pdf of the form 



p(y) 



N 



n n K n (y), 



(5) 



where N > 1, vri, . . . , ttn are (strictly positive) probabilities 
summing to 1 and Ki, . . . ,Kn are unimodal pdfs. We focus 
on the case where the supports of the K n can be nearly 
covered by disjoint subsets fl n (n = 1, . . . , N) so that p is 
strongly multimodal (with N modes). In this case a good 
approximation to the entropy of a random variable of density 
p can be obtained; this entropy will be abusively denoted by 
H (p) instead of H(Y) where Y is a random variable with pdf 
p. Such approximation will be first derived informally (for ease 
of comprehension) and then a formal development giving the 
error bounds of the approximator is provided. 

A. Informal derivation of entropy approximator 

If the random variable has a pdf of the form (0, then its 
entropy equals 

f oo N N 

H(p) = Y *nK n (y) log ]T 7r n K n (y) dy . (6) 

Suppose that there exists disjoint sets f2x,...,fiw that 
nearly cover the supports of the K n densities; even if the K n 
have a finite support, the 57„ may differ from the true support 
of the Ki since these supports may be not disjoint. Then, 
assuming that ir n K n (y) > is small or zero for all y ^ f2 n 
and noting that OlogO = by convention (more rigorously: 
lim ;c _ > o+ xlogx — 0), one gets 

N . N N 

H{p) ~~Y K n K n {y) log [ Y n n K n (y) dy 

— i <I £7™ i 1 



m=l •'"to n —i 
N 



n=l 



K m {y) \og[-K m K m {y)]dy 



If we note n = {tt 1 , ■ ■ ■ , n n ] and h(ir) = - J2n=l w n log7r„ 
the entropy of a discrete random variable taking N distinct 
values with probabilities 7Ti, . . . , ttn, then H{p) w H(p) 
where 



N 



(7) 



n=l 



B. Upper and lower bounds of the entropy of a multimodal 
distribution 

The entropy approximator H(p) in previous subsection is 
actually an upper bound for the entropy. This claim is proved 
in the following; in addition, a lower bound of the entropy 
will be further provided. These bounds permit to analyze 
how accurate is the approximation H(p) w Ti(p); they are 
explicitly computed when all K n are Gaussian kernels. 



1 ) General results: The following Lemma provides upper 
and lower bounds for the entropy. 

Lemma 1: Let p be given by (0, then 

H( P ) < H( P ) (8) 

where H(p) is given by (|7). 

In addition, assume that sup K n — sup ygR K n (y) < oo 
(1 < n < N) and let n±, . . . , fijy De disjoint subsets which 
approximately cover the supports of K\, . . . , Kn, in the sense 
that 

e » " / R \n„ K n(y)dy , 

k\n„ K n{y)log^dy 



are small. Then, we have 



N 



H{p) > H{p) ~ Y, n " e 'n 



n=l 



N 

Er /maxK 
"4 log l — z 



<m<N 



sup K„ 



1 



e„.(9) 



,supif„ 

The proof of this Lemma is given in Appendix I. 
Let us consider now the case where the densities K n in © 
all have the same form: 

K n (y) = (l/a n )K[{y - n n )/a n ] (10) 

where K is a bounded density of finite entropy. Hence 
H(K n ) = H(K) + log a n and the upper bound (0 becomes 

JV 

H[p) <n(p) = H(K) + Y^oga n + h(n). (11) 

n=l 

Also, the lower bound of the entropy given by eq. (O reduces 
to 

N 

Hip) > nip) - Y + 1 + ^ (12) 

n=l 

Let us arrange the fi n by increasing order and take a n small 
with respect to 

d n = min(^i„ - ^t„_i,/i n+1 - fi n ) . (13) 

where //o = — oo and /Ujv+i = oo by convention. Under this 
assumption, the density (0 is strongly multimodal and fl n in 
the above Lemma can be taken to be intervals centered at \i n 
of length d n '- 

n n ^{^ n -d n /2,fi n + d n /2). (14) 
Then simple calculations give 

e » = 1 - I-^SiZl ) K(x)dx , 
/ n = H(K) - H dJa jK) + e„log(sup A'), 

where H a (K) = — J"^ 2 K(x) log K(x)dx. It is clear that 
e„ and e' n both tend to as d n /a n — > oo. Thus one gets the 
following corollary. 

Corollary 1: Let p be given by (0 with K n of the form (TTOb 
and sup^, A(x) < oo. Then H(p) is bounded above by 7~C(p) 
and converges to this bound as mm n (d n /a n ) — > oo, d„ being 
defined in ( fT3l . 
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2) Explicit calculation in the Gaussian case: Let us focus 
on the K(x) = <I>(x) case where $(x) denotes the standard 
Gaussian density: &(x) — (1/V2n)e~ x I 2 . 

The upper and lower bounds of H(p) are given by (fTTT i 
and dTzb with H(Q) instead of H(K); e„ and e' n can now be 
obtained explicitly : 

e„ = Erfc 



2 v / 2<T 7l 



, = £T($) - ff d „ /CTn (*) - e„ log \Z2tt, 

where Erfc is the complementary error function defined as 

Erfc(ir) = (2/y / 7r) J°° exp(—z 2 )dz. By double integration by 
parts and noting that J Erf(a;)rfa; = x Erf (x) +exp(— x 2 )/a/7t 
with Erf(x) = 1 — Erfc(x), some algebraic manipulations give 



Sr. 



( dn 



log(27r e) 



d n 



-</(8<) 



1\1'K(J rt 



One can see that H dn / an {§) — > 
dn/vn — ► oo, as it should be. Finally: 



e„ = Erfc 



1 



2\/2cr, r 



= - Erfc 
2 



-[d„/(2V2a„)] 2 



-2\/2o'ri / ' 2\/27r(T rl 
Example 4: To illustrate Corollary [T] Fig. [2] plots the en- 
tropy of a trimodal variable Y with density p as in (0 with if„ 
given by ( fTOb . cr n = cr (for the ease of illustration), if = $, 
fj, = [0, 5, 10] and it = [1/4, 1/2, 1/4]. Such variable can be 
represented as Y = U+crZ where U is a discrete random vari- 
able taking values in {0, 5, 10} with probabilities 1/4, 1/2, 1/4 
and Z is a standard Gaussian variable independent from U. 
The upper and lower bounds of the entropy are computed as in 
Lemma 1 with the above expressions for e n , e' n , and plotted on 
the same figure. One can see that the lower the cr, the better the 
approximation of H{Y) by its upper and lower bounds. On the 
contrary, when a increases, the difference between the entropy 
and its bounds tend to increase, which seems natural. These 
differences however can be seen to tend towards a constant for 
cr — > oo. This can be explained as follows. When cr is large, p 
is no longer multimodal and tends to the Gaussian density of 
variance cr 2 . Thus H(Y) grows with a as log a. On the other 
hand, the upper bound of H(p) of H(Y) also grows as log cr. 
The same is true for the lower bound of H(Y) which equals 
W(p) - E n =i n n [<4 + en (log tt^ 1 + 1)]: the last term tends to 
h(ir) + 1 as cr — > oo since for fixed d n , e n — > 1 and e' n — > 1/2 
as cr — > oo. 



C. Entropy bounds and decision theory 

The entropy estimator given in eq. (0 has actually close 
connections with decision problems, and a tighter upper bound 
for H(p) can be found in this framework. Assume we have a 
TV-class classification problem consisting in finding the class 
label C of an observation y n , knowing the densities and the 
priors of the classes. In such kinds of classification problems, 



H(Y) 

UpperBound 
LowerBound 

UpperBound-LowerBound 




Fig. 2. Illustration of Example |4] Evolution of H(Y) and its bounds versus 
a, where Y = U + crZ, U is a discrete random variable taking values 
in {0,5,10} with probabilities it = [1/4,1/2,1/4] and Z is a standard 
Gaussian variable independent from U. The lower bound converges to the 
upper bound as <r — » and the difference between upper and lower bounds 
tends to 3/2 + h(n) as <r — > oo (note that the horizontal axis scale is 
logarithmic). 



one is often interested in quantifying the Bayes' probability of 
error P(e). In our context, each of the pdf mode K n represents 
the density of a given class c n , i.e. the conditional density 
of Y given C — c n is K n . Furthermore, ir n is the a priori 
probability of c„ : P(C — c„) = ir n , and p is the density of 
Y, which can thus be seen as a "mixture density". Defining 
h(C) — — >~2n=i -P(C = c n) log P(C = c„), it can be shown 
[23], [24] that 



P{e)< l -h(C\Y) 



1 



[H(Y\C)+h(C)-H(Y)] 



N 

E 

71=1 



ir n H (K n ) + h{ir) - H{Y) 



(15) 



where H(Y\C) = E C [H(Y\C = a)], which shows that half 
the difference between the Tt(p) and H(p) is precisely an 
upper bound of Bayes' probability of error P(e) = Ey[l — 
maxj p(c,|y)]. The error vanishes when the modes have no 
overlap (the classes are separable, i.e. disjoint). 

Clearly, Ti{p) — 2P(e) is a tighter upper bound of H (p) 
than H(p) as P(e) > 0. On the other hand, it can be proved 
that H(p) - 2y/(N- l)P(e) is a lower bound for H(p) [24]. 
However, the lower bound in Lemma 1 is tighter when a is 
small enough. Both bounds in this lemma are easier to deal 
with in more general theoretical developments, are more re- 
lated to the multimodality of p(y) and suffice for our purposes. 
Therefore, in the following theoretical developments, the last 
pair of bounds shall be used. 

IV. Mixing local minima in multimodal BSS 

Based on the results derived in Section IIII-B1 it will be 
shown that mixing local minima of the entropy exist in the 
context of the blind separation of multimodal sources with 
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Gaussian modes if the mode standard deviations <r„ are small 
enough. 

We are interested in the (mixing) local minima of H(wS) 
on the unit sphere S = {w : ||w|| = 1} of M. K . We 
shall assume that the sources have a pdf of the form (0, 
with K n being Gaussian with identical variance a 2 (but with 
distinct means). Thus, as in example @] we may represent 
Sk as Uk + crZk where [//. is a discrete random variable 
and Zk is a standard Gaussian variable independent from 
Uk- Further, (Ui, Z\), . . . , (Uk, Zk) are assumed to be in- 
dependent so that the sources are independent as required. 
From this representation, wS = wU + aZ where U is 
the column vector with components Uk and Z is again a 
standard Gaussian variable (since any linear combination of 
independent Gaussian variables is a Gaussian variable and 
2fc=i w kZk has zero mean and unit variance). Since wU is 
clearly a discrete random variable, wS also has a multimodal 
distribution of the form (0 with K n again the Gaussian 
density with variance a 2 . Note that the number of modes is 
the number of distinct values wU can have and the mode 
centers (the means of the K n ) are these values; they depend 
of w. However, as long as a is small enough with respect 
to the distances d n defined in ( [T3T > the approximation (0 of 
the entropy is justified. Thus, we are led to the approximation 
H(wS) ps /i(wU) + logo- + #($), where /i(wU) denotes 
abusively the entropy of the discrete random variable wU (the 
entropy of a discrete random variable U with probability vector 
7T is noted either h(U) or h(ir)). 

The above approximation suggests that there is a relation- 
ship between the local minimum points of H(wS) and those of 
/i(wU). Therefore, we shall first focus on the local minimum 
points of the entropy of wU before analyzing those of H(wS). 

A. Local minimum points of h(wU) 

The function /i(wU) does not depend on the values that 
wU can take but only on the associated probabilities; these 
probabilities remain constant as w changes unless the number 
of distinct values that wU can take varies. Such number would 
decrease when an equality wu = wu' is attained for some 
distinct column vectors u and u' in the set of possible values 
of U. A deeper analysis yields the following result, which is 
helpful to find the local minimum point of h(w\J). 

Lemma 2: Let U be a discrete random vector in M. K and 
U be the set of distinct values it can take. Assume that there 
exists r > 1 disjoint subsets U\ 1 . . . ,U r of U each containing 
at least 2 elements, such that the linear subspace V spanned 
by the vectors u — Ui, u € Ui \ {ui}, . . . , u — u r , ueii r \ 
{u r }, ui, . . . , u r being arbitrary elements of Ui, . . . ,U r , is 
of dimension K — 1. (Note that V does not depend on the 
choice of Ui, . . . , u r , since u = (u — iiy) — (u^ — Uj) for 
any other S Uj.) Then for w* G S and orthogonal to V, 
there exists a neighborhood W of w* in S and a > such 
that /i(wU) > /i(w*U) + a for all w € W \ {w*}. In the 
case K = 2, one has a stronger result that h(w\J) = h(U) > 
/i(w*U) for all w e W \ {w*}. 

The proof is given in Appendix II. 

Example 5: An illustration of Lemma [2] in the K = 2 
case (again for clarity) is provided in Fig|3] We note U = 
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Fig. 3. Example 5: illustration of Lemma 2. The discrete random variables 
C/l and U 2 take values in {- v / T03 + 2.5, v / ni3 + 2.5} and {-1.2, -.4, 2} 
with probabilities [.5.5] and [1/2,3/8, 1/8], respectively. 



[£/i,[/2] T where the discrete variables U\ and Ui take the 
values — Vl-03 + 2.5, VL03 + 2.5 with probabilities and .5, .5 
and the values -1.2, -.4, 2 with probabilities 1/2,3/8,1/8, 
respectively. They are chosen to have the same variance, as 
we need that the Sk = Uk + crZk, k = 1,2, have the same 
variance. But their mean can be arbitrary since H(wS) does 
not depend on them. In this K = 2 example, each line that 
links two distinct points u, u' G U span a one dimensional 
linear subspace, which constitutes a possible subspace V, as 
stated in Lemma [2] There are thus many possibilities for V, 
each corresponding to a specific vector w*. 

Two simple possibilities for V are the subspaces with 
direction given by [0, 1] T and [1,0] T . In the first case, the 
subsets Ui are built by grouping the points of U laying 
on a same vertical dashed line. There are two such subsets 
(r = 2) consisting ofueW with first component equal to 
— Vl-03 + 2.5 and VL03 + 2.5, respectively. In the second 
case, the subsets Ui are built by grouping the points of U 
laying on a same horizontal dashed line. There are three such 
subsets (r = 3) consisting of u € U with second component 
equal to —1.2, —.4 and 2, respectively. 

There also exist other subspaces V, corresponding to "diag- 
onal lines" (i.e. to solid lines in Fig0. This last kind of one- 
dimensional linear subspace V correspond to directions given 
by two-dimensional vectors w* with two non-zero elements. 

On the plot, the points on the half circle correspond to the 
vectors w* of the Lemma; each w* is orthogonal to a line 
joining a pair of distinct points in U, U being the set of all 
possible values of [Ui, L^P- The points of U are displayed in 
the plot together with their probabilities. The entropies /i(wU) 
are also given in the plot; one can see that they are lower for 
w = w* than for other points w. 

The above Lemma only provides a mean to find a local 
minimum point of the function h(w\J), but does not prove 
the existence of such a point, since the existence of V was 
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only assumed in the Lemma. Nevertheless, in the case where 
the components of U are independent and can take at least 
2 distinct values, subset Ui ensuring the existence of V can 
be built as follows. Let j be any index in {1, ... , K} and 
Xj i, ■ ■ ■ , \j, r be the possible value of Uj, the j-th component 
of U. One can take Ui,l < i < rj to be the set of u G U such 
that its j-th components equal Aj-,j. Then it is clear that the 
corresponding subspace V consists of all vectors orthogonal 
to the j-th row of the identity matrix (hence V is of dimension 
K — 1) and that the associated vector w* is simply this row 
or it opposite. By Lemma |2j this point w* would be a local 
minimum point of /i(wU). But, as explained above, it is a non 
mixing point while we are interested in the mixing point, i.e. 
not proportional to a row of the identity matrix. However, the 
above construction can be extended by looking for a set of K 
vectors Ui , . . . , uk in U, such that the vectors Uj— Uj , 1 < i < 
j < K span any linear subspace of dimension K — 1 of M. K . If 
such a set can be found, then V is simply this linear subspace 
by taking U\ = {u 1; ...,u^-} and r = 1. In addition, if 
Ui, . . . , uk do not all have the same j-th component, for some 
j, then the corresponding w* is a mixing local minimum point. 
In view of the fact that there are at least 2 K points in U to 
choose from for the Ui and that the last construction procedure 
meant not find all local minimum points of /i(wU), chance is 
that there exists both non-mixing and mixing local minimum 
points of li(wU). In the K — 2 case this is really the case: 
it suffices to take two distinct points Ui and U2 in U, then by 
the above Lemma, the vector w* orthogonal to Ui — u 2 is a 
local minimum point of h(wXJ). If one choose u x and u 2 such 
that both components of — u 2 are non zero, the associated 
orthogonal vector w* is not proportional to any row of the 
identity matrix; it is a mixing local minimum point of /i(wU). 
Note that in the particular K = 2 case, the aforementioned 
method identifies all local minimum points of /i(wU). Indeed, 
for any w G S, either there exists a pair of distinct vectors 
Ui, u 2 in U such that w(ui — u 2 ) = or there exists no such 
pair. In the first case w is a local minimum point and in the 
second case one has /i(wU) = h(XJ). Since there is only a 
finite number of the differences u x — u 2 , for distinct Ux, u 2 in 
U, there can be only a finite number of local minimum points 
of h(wXJ), and for all other points /i(wU) take the maximum 
value h(U). 

B. Local minimum points of H (wS) 

This subsection shows that the local minima points of 
H(wS) can be related to those of /i(wU). 

Lemma 3: Define Si, i — 1, • • • ,K, as Si — Ui + aZi 
described at the beginning of subsection||V]and w* be a vector 
satisfying the assumption of Lemma [2] (U being the vector 
with component Ui). Then for a sufficiently small H(wS) 
admits a local minimum point converging to w* as a — > 0. 

The proof of this Lemma is relegated to the Appendix. 

Example 6: Thanks to the entropy approximator, we shall 
illustrate the existence of the local minima of H(wS) in the 
following K = 2 example, so that vectors w satisfying | |w| | = 
1 can be written as [sin 9, cos 8}. We take S± = U^/2 + oZ\ 
and 5 2 = Uq + ctZ 2 , where Uq, U^/ 2 are independent discrete 




-1.69 -0.64 0.04 1.09 -1.8 -0.34 1.1 




-1.71 -1.02 0.26 0.95 -1.61 -0.92 0.36 1.05 




-1.59 -0.13 1.31 -1.44 -0.39 0.29 1.34 



Fig. 4. Example 6: probability density function of wS for various angles 9. 

random variables taking the values — 2\/3/3, \/3/2 with prob- 
abilities 1/3,2/3 and —s/2, y/2/2 with probabilities 3/7, 4/7, 
respectively, and Z\, Z 2 are standard Gaussian variables. The 
parameter a is set to 0.1. Thus Yg = wS can be represented as 
Ug + aZ where Ug = sinf?t/ 7r / 2 -|-cos0{7o and Z is a standard 
Gaussian variable independent from Ug. Figure |4]plots the pdf 
of Yg for various angles 8. It can be seen that the modality (i.e. 
the number of modes) changes with 8. Fig.|5]shows the entropy 
of Yg together with its upper and lower bounds, for 8 € [0, ir). 
In addition to non-mixing local minima at 8 E {pTr/2\p £ Z}, 
mixing local minima exist when w(ui — u 2 ) = 0, where 
ui = [-2V3/3, \/2/2] T , u 2 = [V3/2,-V2] T , i.e. when 
|tan(0)| = .9526, or 9 G {(0.2423 + p)w, (0.7577 + p)n\p G 
Z}. One can observe that the upper bound is a constant 
function except for a finite number of angles for which we 
observe negative peaks (see Lemma [2]). For these angles the 
pdf is strongly multimodal, and the upper and lower bounds 
are very close, though not clearly visible on the figure. This 
results from a discontinuity of the lower bound at these angles, 
due to the superimposition of several modes at these angles. 

V. Complementary observations 

This section provides two observations that can be drawn 
regarding the impact of the mode variance a 2 on the existence 
of local minima and the symmetry of the entropy with respect 
to 8. 

A. Impact of "mode variance " a 2 

In the example of Fig. [6] the discrete variables U\ and f/ 2 
in the expression of S\ and S-x are taken as in Example [3] 
One can observe that the mixing minima of the entropy 
tends to disappear when the mode variance increases. This 
is a direct consequence of the fact that the mode overlaps 
increase. When a increases, the source densities become more 
and more Gaussian and the H(wS) vs 8 curve tends to 
be more and more flat, approaching the constant function 
log v2~7re + log a. The upper and lower bounds have only 
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Fig. 5. Example 6: Upperbound (dashed line), lower bound (dots) and entropy 
estimation of Yg using finite Riemannian sum (solid). It can be seen that the 
upper and lower bounds of the entropy converge to each other when the 
density becomes strongly multimodal (see the corresponding plots in Fig. |4)- 
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Fig. 6. Entropy of wS (estimated using finite Riemannian sum) versus 9 for 
Si = Ui+crZi, S2 = U2+0-Z2, where U\ and U2 are taken from example[5] 
(and Fig. [3} and the four random variables are all independent. The parameter 
a is set to .05 (solid), .25 (dashed-dotted) and .5 (dotted). The upper and lower 
bounds have been added for the a = .05 case only, for visibility purposes. It 
can be seen that the upper and lower bounds of the entropy converge to each 
other when the density becomes strongly multimodal. 

been plotted for the a = .05, for visibility purposes. Again, at 
angles corresponding to the upper bound negative peaks, the 
error bound is very tight, as explained in Example [6] 

B. Note on symmetry of H(wgS) 

In the above graphs plotting the entropy (and its bounds) 
versus 9, some symmetry can be observed. First, if we note 
wg = [sin 9 cos 9], observe that H(wgS) = H(wg +1T S) 
whatever are the source pdfs; this is a direct consequence 
of the fact the the entropy is not sign sensitive. Second, 
if one of the source densities is symmetric, i.e. if it exists 
/i e M. so that (/i — s) = psj (ft + s) for all sd, then 



H(wgS) = iJ(w_gS). Third, if the two sources share the 
same pdf, then H(wgS) = Hfa^^-oS)- Finally, if the two 
sources can be expressed as in Lemma [3] then the vectors w* 
for which h(w*XJ) < h(XJ) (as obtained in Lemma [2]) are 
symmetric in the sense that their angles are pairwise opposite. 
This means that for a small enough, if a local minimum of 
H(wgS) appears at 9*, then another local minimum point 
will exist near —6* (and thus near pit — 9, Vp G Z). The 
above symmetry property can be seen from Figure [3] and can 
be proved formally as follows. From Lemma [2] w* must be 
orthogonal to Ui — U2 for some pair of distinct vectors in the 
set of all possible values of U. Define u| (i — 1,2) to be 
the vector with first coordinate the same as that of U3_i and 
second coordinate the same as that of u,;. Then it can be seen 
that the vector orthogonal to u{ — u| has an angle opposite 
to the angle of w*, yielding the desired result. 

VI. Conclusion 

In this paper, new results regarding both non-mixing and 
mixing entropy local minima have been derived in the context 
of the blind separation of K sources. First, it is shown 
that a local entropy minimum exists when the output is 
proportional to one of the non-Gaussian source. Second, it 
is shown that mixing entropy minima may exist when the 
source densities are strongly multimodal (i.e. multimodal with 
sufficiently small overlap); therefore, spurious BSS solutions 
can be obtained when minimizing this entropic criterion. Some 
attention must be paid to the obtained solutions when they are 
found by adaptive gradient minimization. 

To prove the existence of mixing entropy minima, a theoret- 
ical framework using an entropy approximator and its associ- 
ated error bounds has been provided. Even if this approximator 
is considered here in the context of blind source separation, 
its use can be extended to other applications involving entropy 
estimation. 
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Appendix I 
Proofs of Lemmas 

Proof of Lemma Q] We have from © that H(Y) = 



J2 n =i K n Hn where 



N 

H n = - I K n (y) log [ ^2 K m K m (y) 



dy. (16) 



Since all K m > 0, the last right hand side is bounded above 
by - / K n (y) log[n n K n (y)] dy = H(K n ) - log7r„, yielding 
the inequality (|8). 

A more elegant derivation of this inequality can be obtained 
from the entropy properties. Indeed, the density given in (O 
can be interpreted as the marginal density of an augmented 
model (Y, U) where U is a discrete variable with N val- 
ues u\,...,u n with probabilities tti , . . . , ir n and Y has a 
conditional density given U = u n equal to K n . The joint 
entropy H(Y, U) of (the "continuous-discrete" pair of random 
variables) Y, U equals H(Y\U) + h(U) where h(U) = h(ir) 
is the discrete entropy of U and H(Y\U) = Yln=i ^nHiKn) 
is the conditional entropy of Y given U. But H(Y, U) = 
h(U\Y) + H(Y) (where h(U\Y) is the conditional entropy 
of U given Y) and thus H(p) — H(p) equals h(U\Y) which 
is always nonnegative because U is a discrete variable. 



Yet another way to prove the above inequality is to exploit 
its connection to the decision problem discussed in Sec- 
tion [IILC] Indeed, equation ( [131 ) yields immediately H(p) — 
H{p) > P(e) > 0. 

To prove the second result, noting that log(l + x) < x, the 
term log[^m=i 7T mK m (y)] can be bounded above by 

l<m<N,m^n v J 

k>g(maxi< m <jvsupif m ) otherwise . 



Therefore, with 



N 

K n [y) log [ X n mKm(y) 

m—1 



dy. (18) 



one gets 



K n (y) \og[ir n K n {y)]dy 

l<m<N,m^n ^ " U ™ 

log( max supK m )e n 

Km<N 



But since fii, . . . ,fijv are disjoint, 

N . 

E nn E ~ / K m{y)dy 

n=l l<m<N,m=£n ^ n "^ n " 



E n ™ / K m (y)dy, 

m= l "'Ui<„ ? £ m < J vO Il 

and Ui< n ^ m <Ar57 n CR \ Sl m . Therefore the right hand side 
of the above equality is bounded above by 53m=i n m£m- It 
follows that H(p) = Yln=i n nH n is bounded below by 



K^) + X/ ^nHiKn) +y^7T„ log(7r„ sup K n )e n - ^ TT n e n 



N 



N 



- V" 7T m e m - V" tt„ log( max supi^ m )e n 

* — * * — * l<m<N 
m—1 n— 1 

After some manipulations, the above expression reduces to 
the lower bound for Yl n =i ^nHn given in the Lemma | 

Proof of Lemma |2] 

By construction, for each j = 1, . . . , r, w*u take the same 
values for u G Uj . On the other hand, by grouping the vectors 
u £ U which produce the same value of w*u into subsets of 
U, one gets a partition of U into r* + 1 subsets Uq , . . . , U** , 
such that each U* , 1 < j < r* contains at least two elements 
and w*u takes the same values for u G U* and the values 
associated with different U* and the w* u, u G Uq , are all 
distinct. Obviously r* > 1 and each of the U\, . . . Mr, must 
be contained in one of the U\,. . . , U*». Therefore the space 
V must be contained in the space spanned by the vectors u — 
Uj, u G U* \ {uj}, j = 1, . . . ,r*, ui, . . . , u r » being arbitrary 
elements of U\ , . . . , U*» . But the last space is orthogonal to 
w* by construction and thus cannot have dimension greater 
than K — 1, hence it must coincide with V. 
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Putting P(u) for P(U = u) for short and P{U*) = 
Eugw; P ( u )' one has 

r* 

ft(WU) = - £ P(u)logP(u)-X>(W*)logP(Z4). 

For a given pair u, u' of distinct vectors in U, if w* (u — 
u') ^ then it remains so when w* is changed to w provided 
that the change is sufficiently small. But if w*(u — u') =0 
then this equality may break however small the change. In fact 
if w is not proportional to w* , it is not orthogonal to V, hence 
w(u — u') for at least one pair u, u' of distinct points in 
some U* , meaning that wu takes at least two distinct values in 
U* . Thus there exists a neighborhood of W of w* in S such 
that for all w e W \ {w*}, each subset U* be partitioned into 
subsets Uj^i'w), k = 1, . . . , rij(w) (nj(w) can be 1) such that 
wu takes the same value on £/j,fc(w), and the values of wu 
on the subsets Uj,h (w) and on each points of Uq are distinct. 
Further, there exists at least one index i for which n^(w) > 1. 
For such an index 

P(W*)logP(W*) = £ P[Z4, fe (w)]logP[W a (w)] + 
fc=i 

™ l(w) P(U*\ 

The last term can be seen to be a strictly positive number, as 
P(U*) > P[Wi,fe(w)] for 1 < k < n, ; (w). Note that this term 
does not depend directly on w but only indirectly via the set 
^j,fc(w),fc = 1, . . . ,n_y(w), j = l,...,r*, and there is only 
a finite number of possible such sets. Therefore /i(wU) > 
h(w*XJ) + a for some a > for all w G W. 

In the case K = 2, the space V reduces to a line and thus 
the differences u — u' for distinct u, u' in U*, for all j, are 
proportional to this line. Thus if w is not proportional to w*, 
hence not orthogonal to this line, wu take distinct values on 
each of the sets U\ . . . ,U*,, and if w is close enough to w*, 
these values are also distinct for different sets and distinct from 
the values of wu on Uq, which are distinct themselves. Thus 
for such w, /i(wU) = h{U). | 

Proof of Lemma |3] The proof of this Lemma is quite involve 
in the K > 2 case, therefore, we will first give the proof for 
the K = 2 case which is much simpler, and then proceed by 
extending it to K > 2. As already shown in the beginning of 
section IIV1 wS = wU + crZ where Z is a standard Gaussian 
distribution. Thus, the density of wS is of the form © with 
K n (y) = ®[(y - n n )/a]/a, m,...,fj, N being the possible 
values of li(wU) and $ being the standard Gaussian density. 
For w = w* , one has by Lemma Q] 

H(w*S) < h{w*U) + H(<S>) + logo-. 

On the other hand, we have seen in the proof of Lemma [2] 
that for w in some neighborhood W of w* and distinct from 
w, the wu, u G U (U denoting the set of possible values 
of U) are all distinct (in the K = 2 case). Thus the maps 
u i ► wu map different points u£Wto different /i„ . However, 
when w approaches w*, some of the /z ra tend to coincide 



and thus some of the d n defined in (13[ approach zero. To 
avoid this we restrict w to W \ W where W' is any open 
neighborhood of w* strictly included in W. Then min„ d n > 
d for all w e W\W for some d > (which depends on W')- 
Thus by Corollary Q] H(wS) can be made arbitrarily close to 
ft(wU) + + logo- for all w E W\W by taking a 

small enough. But h(w\J) = h(XJ) > h(w*XJ), therefore 
H(-wS) > i?(w*S) for all w G W \ W , for a small enough. 

One can always choose W to be a close set in S; hence it is 
compact. Since the function w £ W i— ► H(wS) is continuous, 
it must admit a minimum, which by the above result must be 
in W' and thus is not on the boundary of W. This shows that 
this minimum is a local minimum. Finally, as one can choose 
W' arbitrarily small, the above result shows that the above 
local minimum converges to w* as a — > 0. 

Consider now the case K > 2. The difficulty is that it is no 
longer true that for w in some neighborhood W of w* and 
distinct from w, the wu, u 6 U are all distinct. Indeed, by 
construction of w*, there exists K — 1 pairs (uj, u^), 1 < j < 
K, of distinct vectors in U such that the differences Uj — u'j are 
linearly independent and w* (uj — u^) = 0, 1 < j < K. For w 
not proportional to w*, at least one (but not necessary all) of 
the above equalities will break. Therefore all the wu, u S U 
may be not distinct, even if w is restricted to W \ W'. But 
the set of w for which this property is not true anymore is 
the union of a finite number of linear subspaces of dimension 
K — 1 of and thus is not dense in M. K . Therefore for most 
of the w e W \ W, the wu, u e U are all distinct. 

The pdf of wS can be written as 

, . v-^ , , 1 / V — WU \ 

* — ' a \ a J 

ueu 

but some of the wu, u G U can be arbitrarily close to each 
other. In this case it is of interest to group the corresponding 
terms in ( fl9l ) together. Thus we rewrite p(y) as 

n=iuev„ uev„ ^ ueV » v ; 

where Vi, . . . , Vjv is a partition of U. This pdf is still of the 
form (0 with 

E„. , . » v ^ P(u) 1 /W-WU\ 
P u , K n (y) = -^—M- ■ 

uev„ uev„ 
The partition Vi, . . . , Vn can and should be chosen so that 

d(w) = min min Iwu — wu'l , 

l<ra^m<AT ueV„,u'GV,„ 

is bounded below by some given positive number. To this 
end, note that, as is shown in the proof of Lemma [2] w* 
is associated with a partition Uq, . . . of U such that w*u 
take the same value for all u S U* (1 < j < r*), and the 
values associated with different U* and the w*u, u € Uq, 
are all distinct. Thus inf we yy |wu — wu'| > 5 for some 
5 > for all u ^ u' and u, u' do not belong to a same 
Uj, j — 1, . . . , r* . Therefore, the partition {Vi, . . . , Vn} = 
{{u},u G U£,UZ,...,U*} satisfies d(w) > <5,Vw G W. 
We then refine this partition by splitting one of the sets 
U*,j = 1, ...,r* into two subsets. The splitting rule is as 
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follows: for each U* arrange the wu, u S U* in ascending 
order and look for the maximum gap between two consecutive 
values. The set U* that produces the largest gap will be split 
and the splitting is done at the gap. For w S W \ W, this 
maximum gap can be bounded below by a positive number 
8' (noting that there is only a finite number of elements in 
each U*); hence for the refined partition, d(w) > mm(S,5'). 
Of course, the partition constructed this way depends on w, 
but there can be only a finite number of possible partitions. 
Hence, one can find a finite number of subsets Wi , . . . , W q 
which cover W \ W, each of which is associated with a 
partition of U such that the corresponding d(w) is bounded 
below by min(<5, 6') for all w in this subset. In the following 
we shall restrict w to one such subset, W p say, and we denote 
by Vij • ■ ■ , Vjv the associated partition. 

We now apply the Lemma Q] with TT n ,K n ,n = 1, 
defined as above and with the sets il n defined by 

Sl„ = {y : min \y - wu| < d(w)/2}. 

u£V„ 

Then we have, writing d in place of d(w) for short, 



N 



e n < 1 



-d/(2a) 



d/(2a) 



§(x)dx = Erfc 



E 

uev„ 



P(u) 



-$(^— — J log 



2y/2a 
sirpK n 



K n (y) 



dy. 



In each term in the sum in that last right hand side, one applies 
the bound 



sup K n 

which yields, 

P(u) 



< 



a sup K n 



el < 



E 



uev„ " 
log sup (aK n ) 



[P(u)/tt„]*[(|/ - wu)/,] 
cr sup K. 

iog [p 

P(u) 



<p(x) log 

X \>d/{2a) V ' [P(ll)/* n Mx) 



dx 



E 

u6V„ 



log 



P( U ) 



Erfc 



V2\/2ct 



Therefore, putting ft,„ = - Suev„ I jP ( u )/ 7r «] 1 °g[ jP ( u )/ 7r n] 
and noting that supfVi^n) < sup$ = (27r) -1 / 2 , one gets 



JV 

E 

n=l 



AT 



7T„[l0g ( 



maxi< m <jv supK n 



1- 



n—1 



7T n H ri 



Erfc 



7T„ SUp K. 

d 



< 



2V2a 



. Since d = d(w) > min(5, S'), Vw € Wp, the last inequality 
shows that for any i] > 0, 



#(p) > E ^H(K n ) + h(w) -77, Vw e W p , 

n=l 

for a small enough. On the other hand, since log a; < x — 1, 



1 / U — WU 
-$(- 

a \ a 



log 



$[(?/- wu)/cr]/cr 



ciy < 0. 



Multiplying both members of the above inequality by 
P(u)/7r„ and summing up with respect to u S V„, one gets 
P($) + logo- - H{K n ) < 0. Therefore 

H(p) > P($) + log a + h(n) - r\ . 

But by construction h(ir) > /i(w*U) (see the proof of 
Lemma |2); therefore, taking r\ < h(ir) — h(w*XJ), one sees 
that for a small enough PT(wS) = H(p) > H(w*S) for all 
w 6 Wp. Since this is true for all p = 1, . . . , q, we conclude 
as before that ff(wS) admits a local minimum in W'. 



Frederic Vrins was born in Uccle, Belgium, in 
1979. He received the MS degree in mechatronics 
engineering and the DEA degree in Applied Sciences 
from the Universite catholique de Louvain (Belgium) 
in 2002 and 2004, respectively. He is currently work- 
ing towards the PhD degree in the UCL Machine 
Learning Group. His research interests are blind 
source separation, independent component analysis, 
Shannon and Renyi entropies, mutual information 
and information theory in adaptive signal process- 
ing. He is member of the program committee of ICA 




2006. 



Dinh-1\ian Pham was born in Hanoi', VietNam, 
y ^^/0L on February 10, 1945. He is graduated from the 

Engineering School of Applied Mathematics and 
Computer Science (ENSIMAG) of the Polytechnic 
Institute of Grenoble in 1968. He received the Ph. 
D. degree in Statistics in 1975 from the University of 
| Grenoble. He was a Postdoctoral Fellow at Berkeley 

(Department of Statistics) in 1977-1978 and a Visit- 
ing Professor at Indiana University (Department of 
Mathematics) at Bloomington in 1979-1980. He is 
currently Director of Research at the French Centre 
National de Recherche Scientifique (C.N.R.S). His researches include time 
series analysis, signal modelling, blind source separation, nonlinear (particle) 
filtering and biomedical signal processing. 



Michel Verleysen was born in 1965 in Belgium. 
He received the M.S. and Ph.D. degrees in electrical 
,<I^^H^k engineering from the Universite catholique de Lou- 

vain (Belgium) in 1987 and 1992, respectively. He 
was an Invited Professor at the Swiss E.P.F.L. (Ecole 
Polytechnique Federale de Lausanne, Switzerland) 
in 1992, at the Universite d'Evry Val d'Essonne 
(France) in 2001, and at the Universite Paris I- 
Pantheon-Sorbonne in 2002, 2003 and 2004. He 
is now Research Director of the Belgian F.N.R.S. 
(Fonds National de la Recherche Scientique) and 
Lecturer at the Universite catholique de Louvain. He is editor-in-chief of 
the Neural Processing Letters journal and chairman of the annual ESANN 
conference (European Symposium on Artificial Neural Networks); he is asso- 
ciate editor of the IEEE Trans. Neural Networks journal, and member of the 
editorial board and program committee of several journals and conferences on 
neural networks and learning. He is author or co-author of about 200 scientific 
papers in international journals and books or communications to conferences 
with reviewing committee. He is the co-author of the scientific popularization 
book on artificial neural networks in the series "Que Sais-Ie?", in French. His 
research interests artificial neural networks, self-organization, time-series fore- 
casting, nonlinear statistics, adaptive signal processing, information-theoretic 
learning and biomedical data and signal analysis. 



